Suggesting Sounds for Images from Video Collections
نویسندگان
چکیده
Given a still image, humans can easily think of a sound associated with this image. For instance, people might associate the picture of a car with the sound of a car engine. In this paper we aim to retrieve sounds corresponding to a query image. To solve this challenging task, our approach exploits the correlation between the audio and visual modalities in video collections. A major difficulty is the high amount of uncorrelated audio in the videos, i.e., audio that does not correspond to the main image content, such as voice-over, background music, added sound effects, or sounds originating off-screen. We present an unsupervised, clustering-based solution that is able to automatically separate correlated sounds from uncorrelated ones. The core algorithm is based on a joint audio-visual feature space, in which we perform iterated mutual kNN clustering in order to effectively filter out uncorrelated sounds. To this end we also introduce a new dataset of correlated audio-visual data, on which we evaluate our approach and compare it to alternative solutions. Experiments show that our approach can successfully deal with a high amount of uncorrelated audio.
منابع مشابه
SIDF: A Novel Framework for Accurate Surgical Instrument Detection in Laparoscopic Video Frames
Background and Objectives: Identification of surgical instruments in laparoscopic video images has several biomedical applications. While several methods have been proposed for accurate detection of surgical instruments, the accuracy of these methods is still challenged high complexity of the laparoscopic video images. This paper introduces a Surgical Instrument Detection Framework (SIDF) for a...
متن کاملExtending SAR Image Despckling methods for ViSAR Denoising
Synthetic Aperture Radar (SAR) is widely used in different weather conditions for various applications such as mapping, remote sensing, urban, civil and military monitoring. Recently, a new radar sensor called Video SAR (ViSAR) has been developed to capture sequential frames from moving objects for environmental monitoring applications. Same as SAR images, the major problem of ViSAR is the pres...
متن کاملMultimedia Search Technologies
One of the most ubiquitous activities related to learning in the digital age is “search”. In recent years, computers have rapidly evolved from numeric and text processing to include multimedia, specifically audio, video, and images. However, few methods exist for searching multimedia, apart from textbased strategies operating on keywords, metadata and filenames. Creating text descriptions for m...
متن کاملFast Robust Large-scale Mapping from Video and Internet Photo Collections
This paper presents a system approaching fully automatic 3D modeling of large-scale environments. Our system takes as input either a video stream or collection of photographs obtained from Internet photo sharing web-sites such as Flickr. The system achieves high computational performance through algorithmic optimizations for efficient robust estimation, the use of imagebased recognition for eff...
متن کاملLinking the sounds of dolphins to their locations and behavior using video and multichannel acoustic recordings.
It is difficult to attribute underwater animal sounds to the individuals producing them. This paper presents a system developed to solve this problem for dolphins by linking acoustic locations of the sounds of captive bottlenose dolphins with an overhead video image. A time-delay beamforming algorithm localized dolphin sounds obtained from an array of hydrophones dispersed around a lagoon. The ...
متن کامل